PUMA: Planning Under Uncertainty with Macro-Actions

نویسندگان

  • Ruijie He
  • Emma Brunskill
  • Nicholas Roy
چکیده

Planning in large, partially observable domains is challenging, especially when a long-horizon lookahead is necessary to obtain a good policy. Traditional POMDP planners that plan a different potential action for each future observation can be prohibitively expensive when planning many steps ahead. An efficient solution for planning far into the future in fully observable domains is to use temporallyextended sequences of actions, or “macro-actions.” In this paper, we present a POMDP algorithm for planning under uncertainty with macro-actions (PUMA) that automatically constructs and evaluates open-loop macro-actions within forward-search planning, where the planner branches on observations only at the end of each macro-action. Additionally, we show how to incrementally refine the plan over time, resulting in an anytime algorithm that provably converges to an ǫ-optimal policy. In experiments on several large POMDP problems which require a long horizon lookahead, PUMA outperforms existing state-of-the art solvers. Most partially observable Markov decision process (POMDP) planners select actions conditioned on the prior observation at each timestep: we refer to such planners as fully-conditional. When good performance relies on considering different possible observations far into the future, both online and offline fully-conditional planners typically struggle. An extreme alternative is unconditional (or “openloop”) planning where a sequence of actions is fixed and does not depend on the observations that will be received during execution. While open-loop planning can be extremely fast and perform surprisingly well in certain domains1, acting well in most real-world domains requires plans where at least some action choices are conditional on the obtained observations. This paper focuses on the significant subset of POMDP domains, including scientific exploration, target surveillance, and chronic care management, where it is possible to act well by planning using conditional sequences of openloop, fixed-length action chains, or “macro-actions.” We call this approach semi-conditional planning, in that actions are chosen based on the received observations only at the end of each macro-action. Copyright c © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. For a discussion of using open-loop planning for multi-robot tag for open-loop planning see Yu et al. (2005). We demonstrate that for certain domains, planning with macro-actions can offer performance close to fullyconditional planning at a dramatically reduced computational cost. In comparison to prior macro-action work, where a domain expert often hand-coded a good set of macro-actions for each problem, we present a technique for automatically constructing finite-length open-loop macroactions. Our approach uses sub-goal states based on immediate reward and potential information gain. We then describe how to incrementally refine an initial macro-action plan by incorporating successively shorter macro-actions. We combine these two contributions in a forward-search algorithm for planning under uncertainty with macro-actions (PUMA). PUMA is an anytime algorithm which guarantees eventual convergence to an ǫ-optimal policy, even in domains that may require close to fully-conditional plans. PUMA outperforms a state-of-the-art POMDP planner both in terms of plan quality and computational cost on two large simulated POMDP problems. However, semiconditional planning does not yield an advantage in all domains, and we provide preliminary experimental analysis towards determining a priori when planning in a semiconditional manner will be helpful. However, even in domains that are not well suited to semi-conditional planning, our anytime improvement allows PUMA to still eventually compute a good policy, suggesting that PUMA may be viable as a generic planner for large POMDP problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Decentralized control of multi-robot partially observable Markov decision processes using belief space macro-actions

This work focuses on solving general multi-robot planning problems in continuous spaces with partial observability given a high-level domain description. Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) are general models for multi-robot coordination problems. However, representing and solving DecPOMDPs is often intractable for large problems. This work extends the Dec-P...

متن کامل

Efficient Planning under Uncertainty with Macro-actions

Deciding how to act in partially observable environments remains an active area of research. Identifying good sequences of decisions is particularly challenging when good control performance requires planning multiple steps into the future in domains with many states. Towards addressing this challenge, we present an online, forward-search algorithm called the Posterior Belief Distribution (PBD)...

متن کامل

Planning with macro-actions in decentralized POMDPs

Decentralized partially observable Markov decision processes (Dec-POMDPs) are general models for decentralized decision making under uncertainty. However, they typically model a problem at a low level of granularity, where each agent’s actions are primitive operations lasting exactly one time step. We address the case where each agent has macroactions: temporally extended actions which may requ...

متن کامل

Planning with Macro-Actions in Decentralized POMDPs Citation

Decentralized partially observable Markov decision processes (Dec-POMDPs) are general models for decentralized decision making under uncertainty. However, they typically model a problem at a low level of granularity, where each agent’s actions are primitive operations lasting exactly one time step. We address the case where each agent has macroactions: temporally extended actions which may requ...

متن کامل

Online generation and use of macro-actions in forward-chaining planning

This thesis presents a technique for online learning and management of macroactions in forward chaining planning. Macro-actions are learnt on plateaux, areas of the search landscape where the heuristic cannot offer good search guidance, and are reused when future plateaux are encountered. Libraries of macro-actions are generated, storing macro-actions for use on future problems. Several strateg...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010